Data dependency collapsing hardware apparatus

ABSTRACT

A multi-function ALU (arithmetic/logic unit) for use in digital data processing facilitates the execution of instructions in parallel, thereby enhancing processor performance. The proposed apparatus reduces the instruction execution latency that results from data dependency hazards in a pipelined machine. This latency reduction is accomplished by collapsing the interlocks due to these hazards. The proposed apparatus achieves performance improvement while maintaining compatibility with previous implementations designed using an identical architecture.

.Iadd.This application is a file wrapper continuation of U.S.application Ser. No. 07/931,624, filed Aug. 18, 1992, nowabandoned..Iaddend.

BACKGROUND OF THE INVENTION

This invention relates to the execution of scalar instructions in ascalar machine. More particularly, the invention concerns the parallelexecution of scalar instructions when one of the instructions uses as anoperand a result produced by a concurrently-executed instruction.

Pipelining is a standard technique used by computer designers to improvethe performance of computer systems. In pipelining an instruction ispartitioned into several steps or stages for which unique hardware isallocated to implement the function assigned to that stage. The rate ofinstruction flow through the pipeline depends on the rate at which newinstructions enter the pipe, rather than the pipeline's length. In anidealized pipeline structure where a maximum of one instruction is fedinto the pipeline per cycle, the pipeline throughput, a measure of thenumber of instructions executed per unit time, is dependent only on thecycle time. If the cycle time of an n-stage pipeline implementation isassumed to be m/n, where m is the cycle time of the correspondingimplementation not utilizing pipelining techniques, then the maximumpotential improvement offered by pipelining is n.

Although the foregoing indicates that pipelining offers the potential ofan n-times improvement in computer system performance, several practicallimitations cause the actual performance gain to be less than that forthe ideal case. These limitations result from the existence of pipelinehazards. A hazard in a pipeline is defined to be any aspect of thepipeline structure that prevents instructions from passing through thestructure at the maximum rate. Pipeline hazards can be caused by datadependencies, structural (hardware resource) conflicts, controldependencies and other factors.

Data dependency hazards are often called write-read hazards orwrite-read interlocks because the first instruction must write itsresult before the second instruction can read and subsequently use theresult. To allow this write before the read, execution of the read mustbe blocked until the write has occurred. This blockage introduces acycle of inactivity, often termed a "bubble" or "stall", into theexecution of the blocked instruction. The bubble adds one cycle to theoverall execution time of the stalled instruction and thus decreases thethroughput of the pipeline. If implemented in hardware, the detectionand resolution of structural and data dependency hazards may not onlyresult in performance losses due to the under-utilization of hardwarebut may also become the critical path of the machine. This hardwarewould then constrain the achievable cycle time of the machine. Hazards,therefore, can adversely affect two factors which contribute to thethroughput of the pipeline: the number of instructions executed percycle; and the cycle time of the machine.

The existence of hazards indicates that the scheduling or ordering ofinstructions as they enter a pipeline structure is of great importancein attempting to achieve effective use of the pipeline hardware.Effective use of the hardware, in turn, translates into performancegains. In essence, pipeline scheduling is an attempt to utilize thepipeline to its maximum potential by attempting to avoid hazards.Scheduling can be achieved statically, dynamically or with a combinationof both techniques Static scheduling is achieved by reordering theinstruction sequence before execution to an equivalent instructionstream that will more fully utilize the hardware than the former. Anexample of static scheduling is provided in Table I and Table II, inwhich the interlock between the two Load instructions has been avoided.

                  TABLE I                                                         ______________________________________                                        X1               ;any instruction                                             X2               ;any instruction                                             ADD    R4,R2     ;R4 = R4 + R2                                                LOAD   R1,(Y)    ;load R1 from memory location Y                              LOAD   R1,(X[R1])                                                                              ;load R1 from memory location X                                               function of R                                                ADD    R3,R1     ;R3 = R3 + R1                                                LCMP   R1,R4     ;load the 2's complement of (R4) to R1                       SUB    R1,R2     ;R1 = R1 + R2                                                COMP   R1,R3     ;compare R1 with R3                                          X3               ;any compoundable instruction                                X4               ;any compoundable instruction                                ______________________________________                                    

                  TABLE II                                                        ______________________________________                                        X1               ;any instruction                                             X2               ;any instruction                                             LOAD   R1,(Y)    ;load R1 from memory location Y                              ADD    R4,R2     ;R4 = R4 + R2                                                LOAD   R1,(X[R1])                                                                              ;load R1 from memory location X                                               function of R                                                ADD    R3,R1     ;R3 = R3 + R1                                                LCMP   R1,R4     ;load the 2's complement of (R4) to R1                       SUB    R1,R2     ;R1 = R1 + R2                                                COMP   R1,R3     ;compare R1 with R3                                          X3               ;any compoundable instruction                                X4               ;any compoundable instruction                                ______________________________________                                    

While scheduling techniques may relieve some hazards resulting inperformance improvements, not all hazards can be relieved. For datadependencies that cannot be relieved by scheduling, solutions have beenproposed. These proposals execute multiple operations in parallel.According to one proposal, an instruction stream is analyzed based onhardware utilization and grouped into a compound instruction for issueas a single unit. This approach differs from a "superscalar machine" inwhich a number of instructions are grouped strictly on afirst-in-first-out basis for simultaneous issue. Assuming the hardiwareis designed to support the simultaneous issue of two instructions, acompound instruction machine would pair the instruction sequence ofTable II as follows: (-X1) (X2 LOAD) (ADD LOAD) (ADD LCMP) (SUB COMP)(X3,X4), thereby avoiding the data dependency between the second LOADinstruction and the second ADD instruction. A comparable superscalarmachine, however, would issue the following instruction pairs: (X1.X2)(LOAD,ADD) (LOAD,ADD) (LCMP SUB) (COMP X3) (X4-) incurring the penaltyof the LOAD-ADD data dependency.

A second solution for the relief of data dependency interlocks has beenproposed in Computer Architecture News, March, 1988, by the articleentitled "The WM Computer Architecture," by W. A. Wulf. The WM ComputerArchitecture proposes:

1. architecting an instruction set that imbeds more than one operationinto a single instruction;

2. allowing register interlocks within an architected instruction; and

3. concatenating two ALU's as shown in FIG. 1 to collapse interlockswithin a single instruction.

Obviously, in Wulf's proposal, new instructions must be architected forall instruction sequence pairs whose interlocks are to be collapsed.This results in either a prohibitive number of opcodes being defined forthe new instruction set, or a limit, bounded by the number mf opcodesavailable, being placed upon the number of operation sequences whoseinterlocks can be collapsed. In addition, this scheme may not be objectcode compatible with earlier implementations of an architecture. Otherdrawbacks for this scheme include the requirement of two ALUs whoseconcatenation can result in the execution of a multiple operationinstruction requiring close to twice the execution time of a singleinstruction. Such an increase in execution time would reflect into anincrease in the cycle time of the machine and unnecessarily penalize allinstruction executions.

In the case where an existing machine has been architected tosequentially issue and execute a given set of instructions, it would bebeneficial to employ parallelism in instruction issuing and execution.Parallel issue and execution would increase the throughput of themachine. Further, the benefits of such parallelism should be maximizedby minimization of instruction execution latency resulting from datadependency hazards in the instruction pipeline. Thus, the adaptation toparallelism should provide for the reduction of such latency bycollapsing interlocks due to these hazards. However, these benefitsshould be enjoyed without having to pay the costs resulting fromarchitectural changes to the existing machine, creating a newinstruction set to provide all possible instruction pairs and theircombinations possessing interlocks, and adding a great deal of hardware.Further, the adaptation should present a modest or no impact on thecycle time of the machine.

SUMMARY OF THE INVENTION

The invention achieves these objectives in providing a computerarchitected for serial execution of a sequence of scalar operations withan apparatus for simultaneously executing a plurality of scalarinstructions in a single machine cycle. The apparatus is one whichcollapses data dependency between simultaneously-executed instructions,which means that a pair of instructions can be executed even when one ofthe pair requires as an operand the result produced by execution of theother of the pair of instructions.

In this invention, the apparatus for collapsing data dependency whilesimultaneously executing a plurality of scalar instructions includes aprovision for receiving a plurality of scalar instructions to beconcurrently executed and information as to an order of execution ofthose instructions, a second of the scalar instructions using as anoperand the result produced by execution of a first of the scalarinstructions. The apparatus further has provision for receiving threeoperands which are used by the first and second scalar instructions andhas a control component connected to the provision for receiving theinstructions which generates control signals that indicate operationswhich execute the plurality of scalar instructions and which indicatethe order of their execution. A multi-function ALU is connected to theoperands and to the control provisions and responds to the controlsignals and the operands by producing, in parallel with execution of thefirst instruction, a single result corresponding to execution of thesecond instruction.

Viewed from another aspect, the invention is an apparatus which supportssimultaneous execution of a plurality of scalar instructions where aresult produced by a first of the simultaneously executing instructionsis used as an operand in a second of the simultaneously executinginstructions. The apparatus executes the second instruction in parallelwith execution of the first instruction by provision of a datadependency-collapsing ALU which has provision for receiving threeoperands which are used by the first and second instruction to providethe result of the second instruction concurrently with the result of thefirst instruction.

It is therefore a primary object of this invention to provide anapparatus which facilitates the execution of instructions in parallel toincrease existing computer performance.

A significant advantage of the apparatus is the reduction of instructionexecution latency that results from data dependency hazards existing inthe executed instructions.

An objective in this apparatus is to collapse the interlocks due to datadependency hazards existing between instructions which are executed inparallel.

These objectives and advantages are achieved, with a concomitantimprovement in performance and instruction execution by an apparatuswhich is compatible with the scalar computer designed for sequentialexecution of the instructions.

The achievement of these and other objectives and advantages will beappreciated when the following detailed description is read withreference to the below-described drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a prior art architecture for execution of aninstruction which pairs operations.

FIG. 2 is a set of timing sequences which illustrate pipelined executionof scalar instructions.

FIG. 3 is an illustration of an adder which accepts up to three operandsand produces a single result.

FIGS. 4A and 4B illustrate categorization of instructions executed by anexisting scalar machine.

FIG. 5 illustrates functions produced by interlocking cases wherelogical and add-type instructions in category 1 of FIG. 4A are combined.

FIGS. 6A and 6B specify the operations required to be performed onoperands by an ALU according to the invention to support instructionscontained in compoundable categories in FIGS. 4A and 4B.

FIGS. 7A and 7B summarize the routing of operands to an ALU defined inFIGS. 6A and 6B.

FIG. 8 is a block diagram showing how the invention is used to effectparallel execution of two interlocking instructions.

FIG. 9 is a multi-function ALU defined by FIGS. 6A, 6B, 7A, and 7B.

FIG. 10 illustrates functions requiring implementation to collapseinterlocks inherent in hazards encountered in address generation.

FIG. 11 is a logic diagram illustrating a multi-function ALU accordingto FIG. 10.

FIG. 12 lays out the functions supported by an ALU to collapseinterlocks in compounded branching instructions.

FIG. 13 is a logic diagram illustrating an ALU according to FIG. 12.

FIG. 14 illustrates an adder configuration required to collapseinterlocks for instructions involving nine operands.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

In the discussion which follows, the term "machine cycle" refers to thepipeline steps necessary to execute an instruction. A machine cycleincludes individual intervals which correspond to pipeline stages. A"scalar instruction" is an instruction which is executed using scalaroperands. Scalar operands are operands representing single-valuedquantities. The term "compounding" refers to the grouping ofinstructions contained in a sequence of instructions, the grouping beingfor the purpose of concurrent or parallel execution of the groupedinstructions. At minimum, compounding is represented by "pairing" of twoinstructions for simultaneous execution. In the invention beingdescribed, compounded instructions are unaltered from the forms theyhave when presented for scalar execution. As explained below, compoundedinstructions are accompanied by "tags", that is, bits appended to thegrouped instructions which denote the grouping of the instructions forparallel execution. Thus, the bits indicate the beginning and end of acompound instruction.

In the sections to follow an improved hardware solution to relieveexecution unit interlocks that cannot be relieved using prior arttechniques will be described. The goal is to minimize the hardwarerequired to relieve these interlocks and to incur only a modest or nopenalty to the cycle time from the added hardware. No architecturalchanges are required to implement this solution; therefore, object codecompatibility is maintained for an existing architecture.

The presumed existing architecture is exemplified by a sequential scalarmachine such as the System/370 available from the International BusinessMachines Corporation, the assignee of this application. In this regard,such a system can include the System/370, the System/370 extendedarchitecture (370-XA), and the System/370 enterprise systemsarchitecture (370-ESA). Reference is given here to the Principles ofOperation of the IBM System/370, publication number GA22-7000-10, 1987,and to the Principles of Operation, IBM Enterprise SystemsArchitecture/370, publication number SA22-7200-0, 1988.

The instruction set for these existing System/370 scalar architecturesis well known. These instructions are scalar instructions in that theyare executed by operations performed on scalar operands. Referencesgiven hereinbelow to particular instructions in the set of instructionsexecuted by the above-described machines are presented in the usualassembly-level form.

Assume the following sequence of instructions is to be executed by asuperscalar machine capable of executing four instructions per cycle:

                  TABLE III                                                       ______________________________________                                        (1) LOAD      R1, X   load the content of X to R1                             (2) ADD       R1, R2  add R1 to R2 and put the result in R1                   (3) SUB       R1, R3  subtract R3 from R1 and put the                                               result in                                               (4) STORE     R1, Y   store the result in memory location                     ______________________________________                                                              Y                                                   

Despite the capability of multiple instruction execution per cycle, thesuperscalar machine will execute the above sequence serially because ofinstruction interlocks. It has been suggested based on analysis ofprogram traces, that interlocks occur approximately one third of thetime. Thus much of the superscalar machine's resources will be wasted,causing the superscalar's performance to degrade. The superscalarmachine performance of interlocked scalar instructions is illustrated bythe timing sequence indicated by reference numeral 8 in FIG. 2. In thisfigure, the pipeline structure for the instructions of

Table III is assumed to be as follows:

(1) LOAD:ID AG CA PA

(2) and (3) ADD and SUBTRACT:ID EX PA

where ID is instruction decode and register access, AG is operandaddress generation, CA represents cache access, EX represents execute,and PA (put away) represents writing the result into a register. Tosimplify exposition all examples provided in this description assume,unless explicitly stated, that bypassing is not implemented. In thesuperscalar machine, the execution of the instruction stream isserialized due to instruction interlocks reducing the performance of thesuperscalar to that of a scalar machine.

In FIG. 2, instructions (2) and (3) require no address generation (AG).However, this stage of the pipeline must be accounted for. Hence theunlabeled intervals 7 and 9. This convention holds also for the otherthree sequences in FIG. 2.

The above example demonstrates that instruction interlocks can constrainthe parallelism that is available at the instruction level forexploitation by a superscalar machine. Performance can be gained withpipelining and bypassing of the results of one interlocked instructionto the other; nevertheless, the execution of interlocked instructionsmust be serialized.

COMPOUND INSTRUCTIONS

If loss of execution cycles due to interlocks is to be avoided, theinterlocked instructions must be executed in "parallel" and viewed as aunique instruction. This leads to the concept of a compoundedinterlocked instruction, a set of scalar instructions that are to betreated as a single unique instruction despite the occurrence ofinterlocks. A desirable characteristic of the hardware executing acompounded instruction is that its execution requires no more cyclesthan required by one of the compounded instructions. As a consequence ofinstruction compounding and its desired characteristics, a compoundinstruction set machine must view scalar instructions by hardwareutilization rather than opcode description.

EXECUTION OF INTERLOCKED INSTRUCTIONS

The concepts of compounded interlocked instructions can be clarifiedusing the ADD and SUB instructions in Table III. These two instructionscan be viewed as a unique instruction type because they utilize the samehardware. Consequently they are combined and executed as oneinstruction. To exploit parallelism their execution requires theexecution of:

    R1=R1+R2-R3

in one cycle rather than the execution of the sequence:

    R1=R1+R2

    R1=R1-R3

which requires more than one cycle to execute. The interlock can beeliminated because the add and subtract utilize identical hardware.Moreover, by employing an ALU which utilizes a carry save adder, CSA,and a carry look-ahead adder, CLA, as shown in FIG. 3, the combinedinstruction R1+R2-R3 can be executed in one cycle provided that the ALUhas been designed to execute a three-to-one addition/subtractionfunction.

As should be evident, the combined form (R1+R2-R3) corresponds torewriting the two operands of the second instruction in terms of threeoperands, thereby implying the requirement of an adder which can executethe second instruction in response to three operands.

In FIG. 3, the carry save adder (CSA) is indicated by reference numeral10. The CSA 10 is conventional in all respects and receives threeoperands to produce two results, a sum (S) on output 12 and a carry (C)on output 14. For the example given above, the inputs to the CSA 10 arethe operands contained in the three registers R1, R2, and R3(complemented). The outputs of the CSA 10 are staged at 16 and 17 forthe provision of a leading "1" or "0" (a "hot" 1 or 0) on the carryvalue by way of input 20. The value on input 20 is set conventionallyaccording to the function to be performed by the CSA 10.

The sum and carry (with appended 1 or 0) outputs of the CSA 10 areprovided as the two inputs to the carry look-ahead adder (CLA) 22. TheCLA 22 also conventionally receives a "hot" 1 or 0 on input 24 accordingto the desired operation and produces a result on 26. In FIG. 3 theresult produced by CLA 22 is the combination of the contents of thethree registers R1, R2 and R3 (complemented).

Carry save and carry look-ahead adders are conventional components whosestructures and functions are well known. Hwang in his COMPUTERARITHMETIC: Principles, Architecture and Design, 1979, describes carrylook-ahead adders at length on pages 88-93 and carry save adders onpages 97-100.

Despite the three-to-one addition requiring an extra stage, the CSA inFIG. 3, in the critical path of the ALU, such a stage should notcompromise the cycle time of the machine since the length of other pathsusually exceed that of the ALU. These critical paths are usually foundin paths possessing an array access, address generation which requires athree-to-one ALU and a chip crossing; therefore, the extra stage delayis not prohibitive and the proposed scheme will result in performanceimprovements when compared to scalar or superscalar machines. Theperformance improvement is shown in FIG. 2 by the set of pipelined plotsindicated by reference numeral 26. These plots show the execution of theinstruction sequence under consideration by a compound instruction setmachine which includes an ALU with an adder configured as illustrated inFIG. 3.

As shown by the timing sequences 8 and 26 of FIG. 2, execution of thesequence by the compound instruction set machine requires eight cyclesor two cycles per instruction, CPI, as compared to the 11 cycles or 2.75CPI achievable by the scalar and superscalar machines. If bypassing isassumed to be supported in all of the machines, plot sets 28 and 30 ofFIG. 2 describe the execution achievable with the scalar/superscalarmachines and the compound instruction set machine respectively. Fromthese sets, the superscalar machine requires eight cycles or two CPI toexecute the example code while the compound instruction set machinerequires six cycles or 1.5 CPI. The advantage of the compounded machineover both superscalar and scalar machines should be noted along with thelack of advantage of the superscalar machine over the scalar for theassumed instruction sequence.

Compounding of instructions with their simultaneous execution byhardware is not limited to arithmetic operations. For example, mostlogical operations can be compounded in a manner analogous to that forarithmetic operations. Also, most logical operations can be compoundedwith arithmetic operations. Compounding of some instructions, however,could result in stretching the cycle time because unacceptable delaysmust be incurred to perform the compounded function. For example, anADD-SHIFT compound instruction may stretch the cycle time prohibitivelywhich would compromise the overall performance gain. The frequency ofinterlocks between these instructions, however, is low given the lowfrequency of occurrence of shift instructions; therefore, they can beexecuted serially without substantial performance loss.

As described previously, data hazard interlocks occur when a register ormemory location is written and then read by a subsequent instruction.The proposed apparatus of this invention collapses these interlocks byderiving new functions that arise from combining the execution ofinstructions whose operands present data hazards while retaining theexecution of functions inherent in the instruction set. Though someinstruction and operand combinations would not be expected to occur in afunctioning program, all combinations are considered. In general all thefunctions derived from the above analysis as well as the functionsarising from a scalar implementation of the instruction set would beimplemented. In practice, however, certain functions arise whoseimplementation is not well suited to the scheme proposed for thisapparatus. The following presentation elucidates these concepts bydiscussing how new functions arise from combining the execution of twoinstructions. Examples of instruction sequences that are well handledaccording to the invention are presented along with some sequences thatare not handled well. A logical diagram of the preferred embodiment ofthe invention is shown.

The apparatus of the invention is proposed to facilitate the parallelissue and execution of instructions. An example of issuing instructionsin parallel is found in the superscalar machine of the prior art; theinvention of this application facilitates the parallel execution ofissued instructions which include interlocks. The use of the datadependency collapsing hardware of this invention, however, is notlimited to any particular issue and execution architecture but hasgeneral applicability to schemes that issue multiple instructions percycle.

To provide a hardware platform for the present discussion a System/370instruction level architecture is assumed in which up to twoinstructions can be issued per cycle. The use of these assumptions,however, neither constrains these concepts to a System/370 architecturenor to two-way parallelism. The discussion is broken into sections coverALU operations, memory address generation, and branch determination.

In general, the System/370 instruction set can be broken into categoriesof instructions that may be executed in parallel. Instructions withinthese categories may be combined or compounded to form a compoundinstruction. The below-described apparatus of the invention supports theexecution of compounded instructions in parallel and ensures thatinterlocks existing between members of a compound instruction will beaccommodated while the instructions are simultaneously executed. Forexample, the System/370 architecture can be partitioned into thecategories illustrated in FIGS. 4A and 4B.

Rationale for this categorization was based on the functionalrequirements of the System/370 instructions and their hardwareutilization. The rest of the System/370 instructions are not consideredto be compounded for execution in this discussion. This does notpreclude them from being compounded on a future compound instructionexecution engine and possibly use the conclusions of interlock"avoidance" as presented by the present paper.

Consider the instructions contained in category 1 compounded withinstructions from that same category as exemplified in the followinginstruction sequence:

AR R1,R2

SR R3,R4

This sequence, which is free of data hazard interlocks, produces theresults:

    R1=R1+R2

    R3=R3+R4

which comprise two independent instructions specified by the 370instruction level architecture. Executing such a sequence would requiretwo independent and parallel two-to-one ALU's designed to theinstruction level architecture. These results can be generalized to allinstructions sequence pairs that are free of data hazard interlocks inwhich both instructions specify an ALU operation. Two ALU's aresufficient to execute instructions issued in pairs since eachinstruction specifies at most one ALU operation.

Many instruction sequences, however, are not free of data hazardinterlocks. These data hazard interlocks lead to pipeline bubbles whichdegrade the performance in a typical pipeline design. A solution forincreasing processor performance is to eliminate these bubbles from thepipeline by provision of a single ALU that can accommodate data hazardinterlocks. To eliminate these interlocks, the ALU must execute newfunctions arising from instruction pairing and operand conflicts. Thefunctions that arise depend on the ALU operations specified, thesequence of these operations, and operand "conflicts" between theoperations (the meaning of the term operand conflicts will becomeapparent in the following discussion). All instruction sequences thatcan be produced by pairing instructions that are contained within thecompoundable list given earlier in this section and will specify an ALUoperation must be analyzed for all possible operand conflicts.

INTERLOCK-COLLAPSING ALU

The general framework for collapsing interlocks according to theinvention has been presented above. The following presents a moreconcrete example of the analyses to be performed in determining therequirements of an interlock collapsing ALU. Assume the existence of athree-to-one adder as described above in reference to FIG. 3. Let OP1and OP2 represent respectively the first and second of two operations tobe executed. For instance, for the following sequence of instructions,

NR R1 R2

AR R3,R4

OP1 corresponds to the operation NR while OP2 corresponds to theoperation AR (see below for a description of these operations). Let A10,A11, and A12 represent the inputs corresponding to (R1), (R2), and (R3)respectively of the three-to-one adder in FIG. 3. Consider the analysisof compounding the set of instructions (NR, OR, XR, AR, ALR, SLR, SR), asubset of category 1 as defined in FIGS. 4A and 4B. The operations ofthis set of instructions are specified by:

    ______________________________________                                        NR       Bitwise Logical AND represented by A                                 OR       Bitwise Logical OR represented by V                                  XR       Bitwise Exclusive OR represented by ⊕                            AR       32 bit signed addition represented by +                              ALR      32 bit unsigned addition represented by +                            SR       32 bit signed subtraction represented by -                           SLR      32 bit unsigned subtraction represented by -                         ______________________________________                                    

This instruction set can be divided into two sets for furtherconsideration. The first set would include the logical instructions NR,OR and XR, and the second set would include the arithmetic instructions,AR, ALR, SR, and SLR. The grouping of the arithmetics can be justifiedas follows. The AR and ALR can both be viewed as an implicit 33 bit 2'scomplement addition by using sign extension for AR and 0 extension forALR and providing a hot `0` to the adder. Though the setting ofcondition code and overflow are unique for each instruction, theoperation performed by the adder, a binary addition, is common to bothinstructions. Similarly, SR and SLR can be viewed as an implicit 33 bit2's complement addition by using sign extension for SR and 0 extensionfor SLR, inverting the subtrahend, and providing a hot `1` to the adder.The inversion of the subtrahend is considered to be external to theadder. Because the four arithmetic operations essentially perform thesame operation, a binary add, they will be referred to as ADD-typeinstructions while the logical operations will be referred to asLOGICAL-type instructions.

As a result of the reduction of the above instruction set to twooperations, the following sequences of operations must be considered toanalyze the compounding of this instruction set:

LOGICAL followed by ADD

ADD followed by LOGICAL

LOGICAL followed by LOGICAL

ADD followed by ADD.

For each of these sequences, all combinations of registers must beconsidered. The combinations are all four register specifications aredifferent plus the number of ways out of four possible registerspecifications that: 1) two are the same; 2) three are the same; and 3)four are the same. The number of combinations, therefore, can beexpressed as: ##EQU1## where _(n) C_(r) represents n combined r at atime. But,

    .sub.n C.sub.r =n!/((n-r)!r!)

from which formulas the number of combinations can be found to be 12.These 12 register combinations are:

1.R1≠R2≠R3≠R4

2.R1=R2≠R3≠R4

3.R2=R3≠R1≠R4

4.R2=R4≠R1≠R3

5.R3=R4≠R1≠R2

6.R2=R3=R4≠R1

7.R1=R3≠R2≠R4

8.R1=R4≠R2≠R3

9.R1=R2=R3≠R4

10.R1=R2=R4≠R3

11.R1=R3=R4≠R2

12.R1=R2=R3=R4

Of these combinations, only seven through twelve give rise to datadependency interlocks. The functions produced by the above interlockingcases for the LOGICAL-ADD sequences listed earlier are given in FIG. 5.In this Figure, the LOGICAL-type operations are designated by an .oslashed. and the ADD-type operations are denoted by ζ.

While FIG. 5 specifies the operations that must be performed on theoperands of ADD-type and LOGICAL-type instructions to collapse theinterlocks, FIGS. 6A and 6B specify the ALU operations required to beperformed on the ALU inputs AI0, AI1, and AI2 to support all 370instructions that are contained in the compoundable categories of FIGS.4A and 4B. In FIGS. 6A and 6B a unary -- indicates 2's complement and/x/ indicates the absolute value of x. This Figure was derived using ananalysis identical to that given above; however, all possible categorycompoundings were considered. For the operations of FIG. 5 to beexecuted by the ALU, the execution unit controls must route the desiredregister contents to the appropriate inputs of the ALU. FIGS. 7A and 7Bsummarize the routing of the operands that needs to occur for the ALUdefined as in FIGS. 6A and 6B to perform the operations of FIG. 5. Alongwith these routings, the LOGICAL and ADD-type instructions have beengiven to facilitate the mapping of these results to FIGS. 6A and 6B.Routing for some ADD-ADD compoundings were not included since theseoperations require a four input ALU (see "Idiosyncarasies") and are sonoted.

Wile the description thus far has focused the consideration of compoundinstruction analysis on four specifically-enumerated registers, R1, R2,R3, R4, it should be evident that the practice of the invention is notlimited to any four specific registers. Rather, selection of thesedesignations is merely an aid to analysis and understanding. In fact, itshould be evident that the analysis can be generalized, as implied bythe equations given above.

A logical block diagram illustrating an apparatus for implementing themultifunctional ALU described essentially in FIGS. 5, 6A, 6B, 7A, and 7Bis illstrated in FIG. 8. In FIG. 8, a register 50 receives a compoundinstruction including instructions 52 and 54. The compoundedinstructions have appended tags 56 and 58. The instructions and theirtags are provided to decode and control logic 60 which decodes theinstructions and the information contained in their tags to provideregister select signals on output 62 and function select signals onoutput 66. The register select signals on output 62 configure across-connect element 64 which is connected to general purpose registers63 to provide the contents of up to three registers to the three operandinputs AI0, AI1, AI2 of a data dependency collapsing ALU 65. The ALU 65is a multi-function ALU whose functionality is selected by functionselect signals provided on the output 66 of the decode and control logic60. With operands provided from the registers connected through thecross-connect 64, the ALU 65 will perform the functions indicated by thefunction select signals produce a result on output 67.

Operating in parallel with the above-described ALU apparatus is a secondALU apparatus including decode and control logic .[.70.]. .Iadd.870.Iaddend.which decodes the first instruction in the instruction field 52to provide register select signals to a conventional cross connect 872which is also connected to the general purpose registers 63. The logic870 also provides function select signals on output 874 to aconventional two-operand ALU 875. This ALU apparatus is provided forexecution of the instruction in instruction field 52, while the secondinstruction in instruction field 54 is executed by the ALU 65. Asdescribed below, the ALU 65 can execute the second instruction whetheror not one of its operands depends upon result data produced byexecution of the first instruction. Both ALUs therefore operate inparallel to provide concurrent execution of two instructions, whether ornot compounded.

Returning to the compounded instructions 52 and 54 and the register 50,the existence of a compounder is presumed. It is asserted that thecompounder pairs or compounds the instructions from an instructionstream including a sequence of scalar instructions input to a scalarcomputing machine in which the compounder resides. The compounder groupsinstructions according to the discussion above. For example, category 1instructions (FIG. 5) are grouped in logical/add, add/logical,logical/logical, and add/add pairs in accordance with Table 5. To eachinstruction of a compound set there is added a tag containing controlinformation. The tag includes compounding bits which refer to the partof a tag used specifically to identify groups of compound instructions.Preferably, in the case of compounding two instructions, the followingprocedure is used to indicate where compounding occurs. In theSystem/370 machines, all instructions are aligned on a half wordboundary and their lengths are either 2, 4 or 6 bytes. In this case, acompounding tag is needed for every half word. A one-bit tag issufficient to indicate whether an instruction is or is not compounded.Preferably, a "1" indicates that the instruction that begins in a byteunder consideration is compounded with the following instruction. A "0"indicates no compounding. The compounding bit associated with half wordshat do not contain the first byte of an instruction is ignored. Thecompounding bit for the first byte of the second instruction in thecompound pair is also ignored. Consequently, only one bit of informationis needed to identify and appropriately execute compounded instructions.Thus, the tag bits 56 and 58 are sufficient to inform the decode andcontrol logic 60 that the instructions in register fields 52 and 54 areto be compounded, that is executed in parallel. The decode and controllogic 60 then inspects the instructions 52 and 54 to determine whattheir execution sequence is, what interlock conditions, if any obtain,and what functions are required. This determination is illustrated forcategory 1 instructions in FIG. 5. The decode and control logic alsodetermines the funcnons required to collapse any data hazard interlockas per FIGS. 6A and 6B. These determinations are consolidated in FIGS.7A and 7B. In FIGS. 7A and 7B, assuming that the decode and controllogic 60 has, from the tag bits, determined that instructions in fields52 and 54, are to be compounded, the logic 60 sends out a functionselect signal on output 66 indicating the desired operation according tothe left-most column of FIG. 7A. The OP codes of the instructions areexplicitly decoded to provide, in the function select output, thespecific operations in the columns headed OP1 and OP2 of FIGS. 7A and7B. The register select signals on output 62 route the registers in FIG.8 by way of the cross-connect 64 as required in the AI0, AI1, and AI2columns of FIGS. 7A and 7B. Thus, for example, assume that the firstinstruction in field 52 is ADD R1, R2, and that the second instructionis ADD R1,R4. The eighteenth line in FIG. 7A shows the ALU operationswhich the decode and control circuit indicates by OP1=+ and OP2=+, whileregister R2 is routed to input AI0, regtster R4 to input AI1, andregister R1 to input AI2.

Refer now to FIG. 9 for an understanding of the structure and operationof the data dependency collapsing ALU 65. In FIG. 9, a three-operand,single-result adder 70, corresponding to the adder of FIG. 3 is shown.The adder 70 obtains inputs through circuits connected between the adderinputs and the ALU inputs AI0, AI1, and AI2. From the input AI2, anoperand is routed through three logic functional elements 71, 72 and 73corresponding to logical AND, logical OR, and logical EXCLUSIVE-OR,respectively. This operand is combined in these logical elements withone of the other operands and routed to AI0 or AI1 according to thesetting of the multiplexer 80. The multiplexer 75 selects either theunaltered operand connected to AI2 or the output of one of the logicalelements 71, 72, or 73. The input selected by the multiplexer 75 isprovided to an inverter 77, and the multiplexer 78 connects to one inputof the adder 70 either the output of the inverter 77 or the uninvertedoutput of the multiplexer 75. The second input to the adder 70 isobtained from ALU input AI1 by way of a multiplexer 82 which selectseither "0" or the operand connected to ALU input AI1. The output of themultiplexer is inverted through inverter 84 and the multiplexer 85selects either the noninverted or the inverted output of the multiplexer82 as a second operand input to the adder 70. The third input to theadder 70 is obtained from input AI0 which is inverted through inverter87. The multiplexer 88 selects either "0", the operand input to AI0, orits inverse provided as a third input to the adder 70. The ALU output isobtained through the multiplexer 95 which selects the output of theadder 70 or the output of one of the logical elements 90, 92 or 93. Thelogical elements 90, 92, and 93 combine the output of the adder by meansof the indicated logical operation with the operand input to AI1.

It should be evident that the function select signal consistsessentially of the multiplexer select signals A B C D E F G and the"hot" 1/0 selections input to the adder 70. It will be evident that themultiplexer select signals range from a single bit for signals A, B, E,and F to two-bit signals for C, D, and G.

The states of the complex control signal (A B C D E F G 1/0 1/0) areeasily derived from FIG. 7A and 7B. For example, following the ADD R1,R2 ADD R1, R4 example given above, the OP1 signal would set multiplexersignal C to select the signal present on AI2, while the F signal wouldselect the noninverted output of the multiplexer 75, thereby providingthe operand in R1 to the right-most input of the adder 70. Similarly,the multiplexer signals B and E would be set to provide the operandavailable at AI1 in uninverted form to the middle input of the adder 70,while the multiplexer signal D would be set to provide the operand atAI0 to the left-most input of the adder 70, without inversion. Last, thetwo "I/O" inputs are set appropriately for the two add operations. Withthese inputs, the output of the adder 70 is simply the sum of the threeoperands, which corresponds to the desired output of the ALU. Therefore,the control signal G would be set so that the multiplexer 95 wouldoutput the result produced by the adder 70, which would be the sum ofthe operands in registers R1, R2, and R3.

When instruction compounding a logical/add sequence, the logicalfunction would be selected by the multiplexer 75 and provided throughthe multiplexer 78 to the adder 70, while the operand to be added to thelogical operation would be guided through one of the multiplexers 85 or88 to one of the other inputs of the adder 70, with a 0 being providedto the third input. In this case, the multiplexer 95 would be set toselect the output of the adder 70 as the result.

Last, in an add/logical compound sequence, the two operands to be firstadded will be guided to two of the inputs of the adder 70, while the 0will be provided to the third input. The output of the adder isinstantaneously combined with the non-selected operand in logicalelements 90, 92 and 93. The control signal G will be set to select theoutput of the element whose operation corresponds to the secondinstruction of the compound set.

More generally, FIG. 9 presents a logical representation of the datadependency collapsing ALU 65. In deriving this dataflow, the decisionwas made to not support interlocks in which the result of the firstinstruction is used as both operands of the second instruction. Morediscussion of this can be found in the "Idiosyncrasies" section. Thatthis representation implements the other operations required byLOGICAL-ADD cornpoundings can be seen by comparing the dataflow with thefunction column of FIG. 5. In this column, a LOGICAL-type operation upontwo operands is followed by an ADD-type operation between the LOGICALresult and a third operand. This is performed by routing the operands tobe logically combined to AI0 and AI2 of FIG. 9 and through theappropriate one of logical blocks 71, 72, or 73, routing this result tothe adder 70, and routing the third operand through AI1 to the adder.Inversions and provision of hot 1's or 0's are provided as part of thefunction select signal as required by the arithmetic operationspecified. In other cases, an ADD-type operation between two operands isfollowed by a LOGICAL-type operation between the result of the ADD-typeand a third operand. This is performed by routing the operands for theADD-type operation to AI0 and AI2, routing these inputs to the adder,routing the output of the adder to the post-adder logical blocks 90, 92and 93, and routing the third operand through AI3 to these post-adderlogical blocks. LOGICAL-type followed by LOGICAL-type operations areperformed by routing the two operands for the first LOGICAL-type to AI0and AI2 which are routed to the pre-adder logical blocks, routing theresults from the pre-adder logical blocks through the ALU withoutmodification bv addition to zero to the post-adder logical block, androuting the third operand through AI3 to the post-adder logical block.For an ADD-type operation followed by an ADD-type operation, the threeoperands are routed to the inputs of the adder, and the output of theadder is presented to the output of the ALU.

The operation of the ALU 65 to execute the second instruction ininstruction field 54 when there is no data dependency between the firstand second instructions is straightforward. In this case, only twooperands are provided to the ALU. Therefore, if the second instructionis an add instruction, the two operands will be provided to the adder70, together with a zero in the place of the third operand, with theoutput of the adder being selected through the multiplexer 95 as theoutput of the ALU. If the second instruction is a logical instruction,the logical operation can be performed by routing the two operands tothe logical elements 71, 72, and 73, selecting the appropriate output,and then flowing the result through the adder 70 by providing zeros tothe other two adder inputs. In this case, the output of the adder wouldbe equal to the logical result and would be selected by the multiplexer95 as the output of the ALU. Alternatively, one operand can be flowedthrough the adder by addition of two zeros, which will result in theadder 70 providing this operand as an output. This operand is combinedwith the other operand in the logical elements 90, 92, and 93, with theappropriate logical element output being selected by the multiplexer 95as the output of the ALU.

When instructions are compounded as illustrated in FIG. 8, whether ornot dependency exists, the instruction in instruction field 52 ofregister 50 will be conventionally executed by decoding of theinstruction 870, 874, selection of its operands by 70, 871, 872, andperformance of the selected operation on the selected operands in theALU 875. Since the ALU 875 is provided for execution of a singleinstruction, two operands are provided from the selected registerthrough the inputs AI0 and AI1, with the indicated result being providedon the output 877.

Thus, with the configuration illustrated in FIG. 8, the data dependencycollapsing ALU 65, in combination with the conventional ALU 875 supportsthe concurrent (or, parallel) execution of two instructions, even when adata dependency exists between the instructions.

AHAZ-COLLAPSING ALU

Address generation can also be affected by data hazards which will bereferred to as address hazards, AHAZ. The following sequence representsa compounded sequence of System/370 instructions that is free of addresshazards:

AR R1,R2

S R3,D(R4,R5)

where D represents a three nibble displacement. No AHAZ exists since R4and R5 which are used in the address calculation were not altered by thepreceding instruction. Address hazards do exist in the followingsequences:

AR R1,R2

S R3,D(R1,R5)

AR R1,R2

S R3,D(R4,R1)

The above sequences demonstrate the compounding of an RR instruction(category 1 in FIG. 5) with RX instructions (category 9) presentingAHAZ. Other combinations include RR instructions compounded with RS andSI instructions.

For an interlock collapsing ALU, new operations arising from collapsingAHAZ interlocks must be derived by analyzing all combinations ofinstruction sequences and address operand conflicts. Analysis indicatesthat common interlocks, such as the ones contained in the aboveinstruction sequences, can be collapsed with a four-to-one ALU.

The functions that would have to be supported by an ALU to collapse allAHAZ interlocks for a System/370 instruction level architecture arelisted in FIG. 10. For those cases where four inputs are not specified,an implicit zero is to be provided. The logical diagram of an AHAZinterlock collapsing ALU defined by FIG. 10 is given in FIG. 11. A largesubset, but not all, of the functions specified in FIG. 10 are supportedby the illustrated ALU. This subset consists of the functions given inrows one to 21 of FIG. 10. The decision as to which functions to includeis an implementation decision whose discussion is deferred to the"Idiosyncrasies" section.

As FIG. 11 shows, the illustrated ALU includes an adder 100 in which twothree-input, two-output carry save adders 101 and 102 are cascaded witha two-input, single-output carry look ahead adder 103 in such a mannerthat the adder 100 is effectively a four-operand, single-result addernecessary for operation of the ALU in FIG. 11.

In generating FIG. 10, the complexity of the ALU structure wassimplified at the expense of the control logic. This is best explainedby example. Consider the two following System/370 instruction sequences:

NR R1,R2 (4)

S R3,D(R1,R5)

and

NR R1,R2 (5)

S R3,D(R4,R1).

Let the general notation for this sequence be

NR r1,r2

S r3,D(R4,R5).

For the first sequence, the address of the operand is:

    OA=D+(R1∩R2)+5

while that for the second sequence is:

    OA=D+R4+(R1∩R2)

To simplify the execution controls at the expense of ALU complexity, thefollowing two operations would need to be executed by the ALU:

    OA=AG10+(AG11∩AG12)+AG13

    OA=AG10+AG12+(AG11∩AG13)

in which D is fed to AGI0, r2 is fed to AGI1, r4 is fed to AGI2 and r5is supplied to AGI3. The ALU could be simplified however if the controlsdetect which of r4 and r5 possess a hazard with r1 and dynamically routethis register to AGI2. The other register would be fed to AGI3. For thisassumption, the ALU must only support the operation:

    OA=AG10+(AG11∩AG12)+AG13

Trade-offs such as these are made in favor of reducing the complexity ofthe address generation ALU as well as the execution and branchdetermination ALU's.

The ALU of FIG. 11 can be substituted for the ALU 65 in FIG. 8. In thiscase, the decode and control logic 60 would appropriately reflect thefunctions of FIG. 10.

BRANCH HAZARD-COLLAPSING ALU

Similar analyses to those for the interlock collapsing ALU's forexecution and address generation must be performed to derive the affectsof compounding on a branch detertnmation ALU which is given by FIGS. 12and 13. The branch determination ALU covers functions required byinstructions comparing register values. This includes the branchinstructions BXLE, BXH, BCT, and BCTR, in which a register value isincremented by the contents of a second register (BXLE and BXH) or isdecremented by one (BCT and BCTR) before being compared with a registervalue (BXLE and BXH) or 0 (BCT and BCTR) to determine the result of thebranch. Conditional branches are not executed by this ALU.

The ALU illustrated in FIG. 13 include a multi-stage adder 110 in whichtwo carry save adders 111 and 112 are cascaded, with the two outputs ofthe carry save adder 112 providing the two inputs for the carry lookahead adder 113. This combination effectively provides the four-input,single result adder provided for the ALU of FIG. 13.

As an example of the data hazards that can occur, consider the followinginstruction sequence:

AL R1,D(R2,R3)

BCT R1,D(R2,R3)

Let [x] denote the contents of memory location x. The results followingexecution are:

    R1=R1+[D+R2+R3]-1

    Branch if (R1=[D+R2+R3])-1=0

This comparison could be done by performing the operation:

    R1+[D+R2+R3]1.

The results of analyses for the branch determination ALU are provided inFIGS. 12 and 13 without further discussion. The functions supported bythe dataflow include those specified by rows one to 25 of FIG. 12.

The ALU of FIG. 13 can be substituted for the ALU 65 in FIG. 8. In thiscase, the decode and control logic 60 would appropriately reflect thefunctions of FIG. 12.

IDIOSYNCRASIES

Some of the functions that arise from operand conflicts are morecomplicated than others. For example, the instruction sequence:

AR R1,R2

AR R1,R1

requires a four-to-one ALU, along with its attendant complexity, tocollapse the data interlock because its execution results in:

    R1=(R1+R2)+(R1+R2).

Other sequences result in operations that require additional delay to beincorporated into the ALU in order to collapse the interlock. A sequencewhich illustrates increased delay is:

SR R1,R2

LPR R1,R1

which results in the operation

    R1=/R1-R2/.

This operation does not lend itself to parallel execution because theresults of the subtraction are needed to set up the execution of theabsolute value.

Rather than collapse all interlocks in the ALU. an instruction issuinglogic or a preprocessor can be provided which is designed to detectinstruction sequences that lead to these more complicated functions.Preprocessor detecUon avoids adding delay to the issue logic which isoften a near-critical path. When such a sequence is detected, theissuing logic or preprocessor would revert to issuing the sequence inscalar mode, avoiding the need to collapse the interlock. The decisionas to which instruction sequences should or should not have theirinterlocks collapsed is an implementation decision which is dependentupon factors beyond the scope of this invention. Nevertheless, thetrade-off between ALU implementation complexity and issuing logiccomplexity should be noted.

Hazards present in address generation also give rise to implementationtrade-offs. For example, most of the address generation interlocks canbe collapsed using a four-to-one ALU as discussed previously. Thefollowing sequence

AR R1,R2

S R3,D(R1,R1);

however, does not fit in this category. For this case, a five-to-one ALUis required to collapse the AHAZ interlock because the resultingoperation is:

    OA=D+(R1+R2)+(R1+R2)

where OA is the resulting operand address. As before, inclusion of thisfunction in the ALU is an implementation decision which depends on thefrequency of the occurrence of such an interlock. Similar results alsoapply to the branch determination ALU.

GENERALIZATION OF THE ADDER

Analyses similar to those presented can be performed to derive interlockcollapsing hardware for the most general case of n interlocks. For thisdiscussion. refer to FIG. 14. Assuming simple data interlocks such as:

AR R1,R2

AR R3,R1

in which the altered register from the first instruction is used as onlyone of the operands of the second instruction, a (n+1) by one ALU wouldbe required to collapse the interlock. To collapse three interlocks, forexample, using the above assumption would require a four-to-one ALU Thiswould also require an extra CSA stage in the ALU.

The increase in the number of CSA stages required in the ALU, however,is not linear. An ALU designed to handle nine operands as a singleexecution unit would take four CSA stages and one CLA stage. This can beseen from FIG. 14 in which each vertical line represents an adder inputand each horizontal line indicates an adder. Carry-save adders arerepresented by the horizontal lines 200-206, while the carry look-aheadadder is represented by line 209. Each CSA adder produces two outputsfrom three inputs. The reduction in input streams continues from stageto stage until the last CSA reduces the streams to two. The next adderis a CLA which produces one final output from two inputs. Assuming onlyarithmetic operations, a one stage CLA adder, and a four stage CSAadder, the execution of nine operands as a single unit using theproposed apparatus could be accomplished, to a first orderapproximation, in an equivalent time as the solution proposed by Wulf inthe reference cited above.

Data hazard interlocks degrade the performance obtained from pipelinedmachines by introducing stalls into the pipeline. Some of theseinterlocks can be relieved by code movement and instruction scheduling.Another proposal to reduce the degradation in performance is to defineinstructions that handle data interlocks. This proposal suffers fromlimitations on the number of interlocks that can be handled in areasonable instruction size. In addition, this solution is not availablefor 370architectttre compatible machines.

In this invention, an alternative solution for relieving instructioninterlocks has been presented. This invention offers the advantages ofrequiring no architectural changes, not requiring all possibleinstruction pairs and their interlocks to be architected into aninstruction set, presents only modest or no impacts to the cycle time ofthe machine, requires less hardware than is required in the prior artsolution of FIG. 1, and is compatible with System/370-architectedmachines.

While the invention has been particularly shown and described withreference to the preferred embodiment thereof, it will be understood bythose skilled in the art that many changes in form and details may bemade therein without departing from the spirit and scope of theinvention.

What is claimed is:
 1. In a computer architected for serial execution of a sequence of single scalar instructions in a succession of execution cycles, an apparatus for supporting parallel execution of a plurality of scalar instructions in a single instruction cycle, the apptratus comprising:an instruction means for receiving a plurality of scalar instructions, a first of the scalar instructions producing a .Iadd.calculation .Iaddend.result used as an operand by a second of the scalar instructions; an operand means for substantially simultaneously providing a plurality of operands, at least two of said operands being used by the first and second scalar instructions; a control means connected to the instruction means for generating control signals to indicate operations which execute the plurality of scalar instructions; and an execution means connected to the operand means and to the control means and responsive to the control signals and to a plurality of operands including the two operands for producing, in a single execution cycle, a single result corresponding to the performance of said operations on said plurality of operands.
 2. The apparatus of claim 1, wherein the execution means includes an adder which produces a single adder result in response to three operands.
 3. The apparatus of claim 2, wherein the adder includes a carry save adder which produces two outputs in response to the three operands and a carry look ahead adder, connected to the carry save adder, which produces one output in response to the two outputs of the carry save adder.
 4. The apparatus of claim 2, wherein the execution means further includes logical means connected to the operand means and to the adder for performing a logic function on the operands to produce a logic result, the adder producing said single adder result in response to the logic result and one of the operands.
 5. The apparatus of claim 2, wherein the execution means further includes logic means connected to the operand means and to the adder for performing a logic function on a first and second operand to produce a logic result, the execution means producing the single result in response to the logic result and the single adder result.
 6. The apparatus of claim 1 wherein the first scalar instruction is a logical instruction and the second scalar instruction is an arithmetic instruction and the execution means includes logical means for combining first and second operands to produce a logical result required by said logical instruction and arithmetic means for combining the logical result with a third operand to produce said single result, said single result being required by the arithmetic instruction.
 7. The apparatus of claim 1 wherein the first scalar instruction is an arithmetic instruction and the second scalar instructlon is a logical instruction and the execution means includes arithmetic means for combining first and second operands to produce an arithmetic result required by said arithmetic instruction and logical means for combining the arithmetic result with a third operand to produce said single result, said single result being required by the logical instruction.
 8. The apparatus of claim 1, wherein the first scalar instruction is an arithmetic instruction and the second scalar instruction is an arithmetic instruction and the execution means includes arithmetic means for combining the three operands to produce a single arithmetic result, said single arithmetic result being provided as said single result.
 9. The apparatus of claim 1 wherein the first scalar instruction is a logical restruction and the second scalar instruction is a logical instruction and the execution means includes logical means for combining first and second operands to produce a first logical result, said first logical result required by said first logical instruction, and second logical means for combining the first logical result with a third operarand to produce a second logical result, said second logical result being required by the second scalar instruction and said second logical result being provided as said single result.
 10. A multifunction ALU (arithmetic logic unit) for combining three operands to produce a single result in response to a pair of instructions, including:a first set of logical elements for logically combining two operands to produce a first logical result; an adder for arithmetically combining three operands to produce a single arithmetic result; a circuit for inputting to the adder either all of said operands, two of said operands and a zero, one of said operands, a zero, and said first logical result, or two zeros and said first logical result; a second set of logical elements for logically combining one of said operands with said single arithmetic result to produce a second logical result; and a circuit for providing as an output either said arithmetic result or said second logical result.
 11. The multifunction ALU of claim 10, wherein said adder includes:a carry-save adder for producing two outputs in response to three operands: and a carry look ahead adder connected to said carry save adder for producing one output in response to said two outputs.
 12. In a computer architected for serial execution of a sequence of scalar mstrucnons in a succession of execution periods, an interlock-collapsing apparatus for supporting simultaneous parallel execution of a plurality of scalar instructions, the apparatus comprising:an instruction register means for receiving a plurality of scalar instructions for simultaneous execution, a first instruction of the plurality of scalar instructions producing a result used as an operand by a second instruction of the plurality of scalar instructions; an operand means for substantially simultaneously providing a plurality of operands used in execution the plurality of sclar instructions; a control means connected to the instruction register means for generating control signals to indicate operands which execute the plurality of scalar instructions; and an interlock-collapsing execution means connected to the operand means and to the control means .[.an.]. .Iadd.and .Iaddend.responsive to the control signals and to the plurality of operands for producing a single result corresponding to the simultaneous execution of .Iadd.said .Iaddend.first and second instructions in a single execution period.
 13. The apparatus of claim 12, wherein the interlock-collapsing execution means includes and adder which produces a single adder result in response to three operands.
 14. The apparatus of claim 13, wherein the adder includes a carry save adder which produces two outputs in response to the three operands and a carry lookahead adder connected to the carry save adder which produces one output in response to the two outputs of the carry save adder.
 15. The apparatus of claim 13, wherein the interlock-collapsing execution means further includes logic means connected to the operand means and to the adder for performing a logic function on the operands to produce a logic result, the adder producing the single adder result in response to the logic result and one of the operands.
 16. The apparatus of claim 13, wherein the interlock-collapsing execution means further includes logic means connected to the operand means and to the adder for performing a logic function on a first operand and a second operand to produce a logic result, the interlock-collapsing execution means producing the single result in response to the logic result and the single adder result.
 17. The apparatus of claim 12, wherein the first instruction is a logical instruction and the second instruction is an arithmetic instruction and the interlock-collapsing execution means includes logical means for combining first and second operands to produce a logic result required by the logical instruction and arithmetic means for combining the logic result with a third operand to produce the single result, the single result representing exectution of the arithmetic instruction.
 18. The apparatus of claim 12, wherein the first instruction is an arithmetic instruction and the second instruction is a logic instruction and the interlock-collapsing execution means includes arithmetic means for combining first and second operands to produce an arithmetic result required by said arithmetic instruction and logic means for combining the arithmetic result with a third operand to produce the single result, the single result representing execution of the logical instruction.
 19. The apparatus of claim 12, wherein the first instruction is an arithmetic instruction and the second instruction is an arithmetic instruction and the interlock-collapsing execution means includes arithmetic means for combining three operands to produce a single arithmetic result, the three operands including two operands used in the execution of the first and second instructions.
 20. The apparatus of claim 12, wherein the first instruction is a first logic instruction and the second instruction is a second logic instruction and the interlock-collapsing execution means includes logic means for combining first and second operands to produce a first logic result, the first logic result being required by the first logic instruction, and second logic means for combining the first logic result with a third operand to produce a second logic result, the second logic result representing execution of the second logic instruction and the second logic result being provided as the single result.
 21. In a computer architected for serial execution of a sequence of scalar instructions in a succession of execution cycles, an execution apparatus for, in a single execution cycle, producing a result representing simultaneous execution of a first scalar instruction and a second scalar instruction in which the second scalar instruction requires a result produced by execution of the first scalar instruction, the execution apparatus comprising:an instruction register means for receiving the first and second scalar instructions; an operand means for substantially simultaneously providing a plurality of operands, at least two of the plurality of operands being used in executing the first and second scalar instructions: a control means connected to the instruction register means for generating control signals which indicate execution of the first scalar instruction and the second scalar instruction; a first execution means connected to the operand means and to the control means and responsive to the control signals and to the two operands for producing, in an execution cycle, a result corresponding to the execution of the first instruction; and a second execution means connected to the operand means and to the control means and responsive to the control signals and to a plurality of operands including the two operands for producing, in said execution cycle, a single result corresponding to the execution of the first and second instructions.
 22. The apparatus of claim 21, wherein the first execution means includes an adder which produces a single adder result in response to two operands.
 23. The apparatus of claim 21, wherein the second execution means includes an adder which produces a single adder result in response to three operands.
 24. The apparatus of claim 23, wherein the adder includes a carry save adder which produces two outputs in response to the three operands and a carry lookahead adder connected to the carry save adder, which produces one output in response to the two outputs of the carry save adder. .Iadd.
 25. In a computer system, an apparatus for supporting parallel execution of a plurality of instructions in a single execution cycle, the apparatus comprising:an instruction means for receiving a plurality of instructions, a first of the instructions producing a calculation result used as an operand by a second of the instructions; an operand means for substantially simultaneously providing a plurality of operands; a control means connected to the instruction means for generating control signals to indicate operations which execute the plurality of instructions; and an execution means connected to the operand means and to the control means and responsive to the control signals and to the plurality of operands for producing a single result corresponding to the performance of said operations, including execution of the first and second of the instructions, on said plurality of operands in a single execution cycle. .Iaddend..Iadd.26. The apparatus of claim 25, wherein the execution means includes an adder which produces a single adder result in response to three operands in the single execution cycle. .Iaddend..Iadd.27. The apparatus of claim 26, wherein the adder includes a carry save adder which produces two outputs in response to the three operands and a carry look ahead adder, connected to the carry save adder, which produces one output in response to the two outputs of the carry save adder. .Iaddend..Iadd.28. The apparatus of claim 26, wherein the execution means fuher includes logical means connected to the operand means and to the adder for performing a logic function on the operands to produce a logic result, the adder producing said single adder result in response to the logic result and one of the operands. .Iaddend..Iadd.29. The apparatus of claim 26, wherein the execution means further includes logic means connected to the operand means and to the adder for performing a logic function on a first and second operand to produce a logic result, the execution means producing the single result in response to the logic result and the single adder result. .Iaddend..Iadd.30. The apparatus of claim 25, wherein the first instruction is a logical instruction and the second instruction is an arithmetic instruction and the execution means includes logical means for combining first and second operands to produce a logical result required by said logical instruction and arithmetic means for combining the logical result with a third operand to produce said single result, said single result being required by the arithmetic instruction. .Iaddend..Iadd.31. The apparatus of claim 25 wherein the first instruction is an arithmetic instruction and the second instruction is a logical instruction and the execution means includes arithmetic means for combining first and second operands to produce an arithmetic result required by said arithmetic instruction and logical means for combining the arithmetic result with a third operand to produce said single result, said single result being required by the logical instruction. .Iaddend..Iadd.32. The apparatus of claim 25, wherein the first instruction is an arithmetic instruction and the second instruction is an arithmetic instruction and the execution means includes arithmetic means for combining the three operands to produce a single arithmetic result, said single arithmetic result being provided as said single result. .Iaddend..Iadd.33. The apparatus of claim 25, wherein the first instruction is a logical instruction and the second instruction is a logical instruction and the execution means includes logical means for combining first and second operands to produce a first logical result, said first logical result required by said first logical instruction, and second logical means for combining the first logical result with a third operand to produce a second logical result, said second logical result being required by the second instruction and said second ogical result being provided as said single result. .Iaddend..Iadd.34. In a computer system, an interlocking-collapsing apparatus for supporting simultaneous parallel execution of a plurality of instructions, the apparatus comprising: an instruction register means for receiving a plurality of instructions for simultaneous execution, a first instruction of the plurality of instructions producing a result used as an operand by a second instruction of the plurality of instructions; an operand means for substantially simultaneously providing a plurality of operands used in executing the plurality of instructions; a control means coupled to the instruction register means for generating control signals to indicate operations which execute the plurality of instructions; and an interlock-collapsing execution means coupled to the operand means and to the control means and responsive to the control signals and to the plurality of operands for producing a result corresponding to the simultaneous execution of first and second instructions in a single execution period. .Iaddend..Iadd.35. The apparatus of claim 34, wherein the interlock-collapsing execution means includes a carry save adder which produces two outputs in response to receiving three operands and a carry look ahead adder connected to the carry save adder which produces a single adder result in the execution period. .Iaddend..Iadd.36. The apparatus of claim 35 further including logic means connected to the operand means and to the interlock-collapsing execution means for performing a logic function on the operands to produce a logic result, the interlock-collapsing execution means including an adder which produces the single adder result in response to the logic result and one of the operands. .Iaddend..Iadd.37. The apparatus of claim 35 further including logic means connected to the operand means and to the interlock-collapsing execution means for performing a logic function on a first operand and a second operand to produce a logic result, the interlock-collapsing execution means producing the result in response to the logic result and the single adder result. .Iaddend..Iadd.38. The apparatus of claim 34, wherein the interlock-collapsing execution means includes a carry save adder which produces two outputs in response to receiving three operands and a carry look ahead adder connected to the carry save adder which produces one output in response to the two outputs of the carry save adder. .Iaddend..Iadd.39. The apparatus of claim 34, wherein the first instruction is a logical instruction and the second instruction is an arithmetic instruction and the interlock-collapsing execution means includes logic means for combining first and second operands to produce a logic result required by the logical instruction and arithmetic means for combining the logic result with a third operand to produce a single result, the single result representing execution of the arithmetic instruction. .Iaddend..Iadd.40. The apparatus of claim 34, wherein the first instruction is arithmetic instruction and the second instruction is a logic instruction and the interlock-collapsing execution means includes arithmetic means for combining first and second operands to produce an arithmetic result required by said arithmetic instruction and logic means for combining the arithmetic result with a third operand to produce a single result, the single result representing execution of the logical instruction. .Iaddend..Iadd.41. The apparatus of claim 34, wherein the first instruction is an arithmetic instruction and the second instruction is an arithmetic instruction and the interlock-collapsing execution means includes arithmetic means for combining three operands to produce a single arithmetic result, the three operands including two operands used in the execution of the first and second instructions. .Iaddend..Iadd.42. The apparatus of claim 34, wherein the first instruction is a first logic instruction and the second instruction is a second logic instruction and the interlock-collapsing execution means includes logic means for combining first and second operands to produce a first logic result, the first logic result being required by the first logic instruction, and second logic means for combining the first logic result with a third operand to produce a second logic result, the second logic result representing execution of the second logic instruction and the second logic result being provided as the single result. .Iaddend..Iadd.43. In a computer system, an execution apparatus for, in a single execution cycle, producing a result representing simultaneous execution of a first instruction and a second instruction in which the second instruction requires a result produced by execution of the first instruction, the execution apparatus comprising: an instruction register means for receiving the first and second instruction; an operand means for substantially simultaneously providing a plurality of operands, at least two of the plurality of operands being used in execution of the first and second instructions; a control means connected to the instruction register means for generating control signals which indicate execution of the first instruction and the second instruction; a first execution means connected to the operand means and to the control means and responsive to the control signals and to the two operands for producing, in an execution cycle, a result corresponding to the execution of the first instruction; and a second execution means connected to the operand means and to the control means and responsive to the control signals and to a plurality of operands including the two operands for producing, in said execution cycle, a single result corresponding to the execution of the first and second instructions. .Iaddend..Iadd.44. The apparatus of claim 43, wherein the first execution means includes an adder which produces a single adder result in response to two operands. .Iaddend..Iadd.45. The apparatus of claim 43, wherein the second execution means includes an adder which produces a single adder result in response to three operands. .Iaddend..Iadd.46. The apparatus of claim 45, wherein the adder includes a carry save adder which produces two outputs in response to the three operands and a carry look ahead adder connected to the carry save adder, which produces one output in response to the two outputs of the carry save adder. .Iaddend. 